MEDB 5505, Module03

2025-02-01

Topics to be covered

  • What you will learn
    • Reading text files
    • Comma delimited files
    • Tab delimited files
    • Other delimiters
    • Fixed width files
    • Real world examples
    • Your programming assignment

Text files, 1

  • Advantages
    • Easy import into many programs
    • Review using notepad
  • Disadvantages
    • Bigger size
    • Slower to import

Text files, 2

  • Wide range of formats
    • Delimited
    • Fixed width
  • First row for variable names
    • Optional but recommended
  • Always look for a data dictionary

Should I download before reading?

  • Read directly from website
    • Convenient
    • Updates incorporated at each run
  • Download then read
    • Downloaded file doesn’t disappear
    • Avoid repeated long downloads
    • Work even when Internet connection is down

No data dictionary?

  • Peek at file
    • Same number of delimiters on each line
    • Tabs versus multiple blanks are hard to distinguish
  • http://www.pmean.com/12/pesky.html

No data dictionary?

  • Experiment
    • Read warnings carefully
  • If needed, edit the file manually
    • Simple edits of one or two offending lines
    • Global search and replace
    • Change tabs to blanks
    • Change multiple blanks to single blank

Troubleshooting

  • Multiple data read in as single variable.
  • Lots of missing values

Break #1

  • What you have learned
    • Reading text files
  • What’s coming next
    • Comma delimited files

An example of a comma delimited file

x,y
1,4
2,8
3,12
4,16

the read_csv function

raw_data <- read_csv(
  file="../data/simple.csv",
  col_names=TRUE,
  col_types="nn")

glimpse(raw_data)

Live demonstration, 1

Now, you will see a live demonstration of the program simon-5505-03-demo-02.

Break #2

  • What you have learned
    • Comma delimited files
  • What’s coming next
    • Tab delimited files

Tab delimited files

x   y
1   4
2   8
3   12
4   16

Using the read_tsv function

raw_data <- read_tsv(
  file="../data/simple.tsv",
  col_names=TRUE,
  col_types="nn")

glimpse(raw_data)

Live demonstration, 2

Now, you will see a live demonstration of the program simon-5505-03-demo-03.

Break #3

  • What you have learned
    • Tab delimited files
  • What’s coming next
    • Other delimiters

Anything can be a delimiter

x~y
1~4
2~8
3~12
4~16

Using the read_delim function with delim=“~”

raw_data <- read_delim(
  file="../data/tilde.txt",
  delim="~",
  col_names=TRUE,
  col_types="nn")

glimpse(raw_data)

Live demonstration, 3

Now, you will see a live demonstration of the program simon-5505-03-demo-04.

Break #4

  • What you have learned
    • Other delimiters
  • What’s coming next
    • Fixed width files

Reading fixed width format files

1 4
2 8
312
416

The read_fwf function

raw_data <- read_fwf(
  file="../data/fixed.txt", 
  col_names=c("x", "y"),
  col_positions = fwf_cols(1, 2),
  col_types="nn")

glimpse(raw_data)

Helpful functions with read_fwf

  • fwf_empty()
    • Uses spacing to guess at column positions
  • fwf_widths()
    • Specifies column widths
  • fwf_positions()
    • Specifies start and end locations for each column

Disadvantages of fixed width formatting?

  • Confusing
    • What is 312?
      • 3, 1, and 2?
      • 31 and 2?
      • 3 and 12?
      • 312?
  • More work
  • Prone to errors

Example where fixed width formatting is needed.

Live demonstration, 4

Now, you will see a live demonstration of the program simon-5505-03-demo-05.

Break #5

  • What you have learned
    • Fixed width files
  • What’s coming next
    • Real world examples

Function arguments for advanced options

  • col_select=
  • na=
  • name_repair=
  • skip=

Example 1, binary.csv

Example 1, a brief description

Example 1, viewing the file in Notepad

Example 1, the code to peek at the data

url_binary <- "https://stats.idre.ucla.edu/stat/data/binary.csv"
read_lines(
  file=url_binary, 
  n_max=10)

Example 1, the code to read the data

example_binary <- read_csv(
  file=fn,
  col_names=TRUE,
  col_types="nnnn")

glimpse(example_binary)

Example 2, barbershop-music.txt

Example 2, viewing the file in Notepad

Example 2, the code to peek at the data

url_barbershop <- "https://dasl.datadescription.com/download/data/3061"
read_lines(
  file=url_barbershop,
  n_max=10)
 [1] "Singing\tPerformance\tMusic"
 [2] "151\t143\t138"              
 [3] "152\t146\t136"              
 [4] "146\t143\t140"              
 [5] "146\t147\t142"              
 [6] "145\t141\t134"              
 [7] "144\t139\t140"              
 [8] "133\t138\t132"              
 [9] "129\t135\t128"              
[10] "134\t125\t132"

Example 2, the code to read the data

raw_data <- read_tsv(
  file=url_barbershop,
  col_names=TRUE,
  col_types="nnn")

glimpse(raw_data)

Example 3, airport.txt

Example 3, peeking at the file on the web

Example 3, a description of the data

  • Here is an excerpt from the data dictionary.

VARIABLE DESCRIPTIONS:
Airport                               Columns 1-21
City                                  Columns 22-43 
Scheduled departures                  Columns 44-49 
Performed departures                  Columns 51-56
Enplaned passengers                   Columns 58-65
Enplaned revenue tons of freight      Columns 67-75
Enplaned revenue tons of mail         Columns 77-85

Example 3, the code to peek at the data

url_airport <- "http://jse.amstat.org/datasets/airport.dat.txt"
read_lines(
  file=url_airport,
  n_max=10)

Example 3, Defining variable names and column locations

start_column <- c( 1, 22, 44, 51, 58, 67, 77)
end_column <-   c(21, 43, 49, 56, 65, 75, 85)
variable_names <- c(
  "airport",
  "city",
  "scheduled_departures",
  "performed_departures",
  "enplaned_passengers",
  "enplaned_freight",
  "enplaned_mail")

Example 3, the code to read the data

example_3 <- read_fwf(
  file=url_3, 
  fwf_positions(
    start=start_column, 
    end=end_column),
  col_names=variable_names,
  col_types="ccnnnnn")

glimpse(example_3)

Break #6

  • What you have learned
    • Real world examples
  • What’s coming next
    • Your programming assignment

This programming assignment was written by Steve Simon on 2024-12-18 and is placed in the public domain.

Program

  • Download the xx program
    • Store it in your src folder
  • Modify the file names
    • Use your last name instead of “simon”
  • Modify the documentation headers
    • Add your name
    • Optional: change the copyright statement

Data

Question 1

xx

Question 2

xx

Grading rubric

You will be evaluated using the general grading rubric for programming assignments.

Your submission

  • Save the output in html format
  • Convert it to pdf format.
  • Make sure that the pdf file includes
    • Your last name
    • The number of this course
    • The number of this module
  • Upload the file

If it doesn’t work

Please review the suggestions if you encounter an error page.

Summary

  • What you have learned
    • Reading text files
    • Comma delimited files
    • Tab delimited files
    • Other delimiters
    • Fixed width files
    • Real world examples
    • Your programming assignment